New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Fix/remove duplicates in rcm #13

Open

noctillion wants to merge 2 commits into master from fix/remove_duplicates_in_rcm

+43 −13

Contributor

noctillion commented Nov 11, 2024

Code updates to refresh the GFF3 release, and code refactoring to prevent duplicate genId entries in the generated transcriptomic raw count matrix.

noctillion added 2 commits

November 11, 2024 10:54


          rf generate and assign matrices

4609d71


          fix rcm with duplicated geneId

1b8aa45

noctillion requested a review from v-rocheleau

November 11, 2024 18:26

v-rocheleau requested changes

View reviewed changes

v-rocheleau left a comment

Resulting matrices are good, thanks!
Some small code requests.

transcriptomics/transcriptomics_matrix_generator.py

Comment on lines 70 to 74

    
                  def generate_gene_names(self, url):

                      file_name = os.path.basename(urlparse(url).path)

                      self.gene_names = self.download_gff(url, file_name)

                      self.num_genes = len(self.gene_names)

                      if not hasattr(self, 'gene_names') or not self.gene_names:

                          self.download_gff(url, file_name)

                      return self.gene_names

v-rocheleau Nov 12, 2024

Seems that this function is not used anymore, since process_gff sets self.gene_names. RM

transcriptomics/transcriptomics_matrix_generator.py

@@ @@ -45,23 +45,32 @@ def load_sample_info(self, json_file_path): @@
                           self.treatments = [item['Treatment'] for item in sample_info]
                           self.experiment_id = [item['ExperimentID'] for item in sample_info]
-                  def download_gff(self, url, file_path):
+                  def download_gff(self, url, file_path):

v-rocheleau Nov 12, 2024

Would rename to download_and_process_gff, since it does more than just download now.

transcriptomics/transcriptomics_matrix_generator.py

+                      self.gene_names = gene_info['GeneName'].tolist()
+                      self.num_genes = len(self.gene_names)
+                      return output_file_csv, file_path

v-rocheleau Nov 12, 2024

Return values are not used

transcriptomics/transcriptomics_matrix_generator.py

                       if not os.path.exists(file_path):
                           subprocess.run(['wget', '-O', file_path, url], check=True)
+                      self.process_gff(file_path)
+                  def process_gff(self, file_path):

v-rocheleau Nov 12, 2024 •

edited

Loading

Would rename to something like write_gene_info, since the function now saves the gene lengths in a csv.

transcriptomics/transcriptomics_matrix_generator.py

+                      print(f"Gene lengths have been saved to {output_file_csv}.")
+                      self.gene_names = gene_info['GeneName'].tolist()
+                      self.num_genes = len(self.gene_names)

v-rocheleau Nov 12, 2024 •

edited

Loading

In this kind of situation, when you need to hold a list of values (gene_names), and make decisions based the length of the list, it is best to simply get the length directly from the property when you need it. With self.gene_names, all we need to get the number of genes is len(self.gene_names).

Otherwise, you have to maintain the state of 2 closely linked properties instead of just 1, which is not ideal for maintainability.

transcriptomics/transcriptomics_matrix_generator.py

Comment on lines +103 to +108

+                      if not self.gene_names:
+                          raise ValueError("No gene names available for count matrix generation.")
+                      unique_gene_names = list(filter(None, set(self.gene_names)))
+                      if not unique_gene_names:
+                          raise ValueError("No valid gene names available after filtering duplicates and empty entries.")

v-rocheleau Nov 12, 2024

I don't think this is needed, since the values are already deduplicated when the self.gene_names property is set.

v-rocheleau reviewed

View reviewed changes

transcriptomics/transcriptomics_matrix_generator.py

                       genes.loc[:, 'length'] = genes['end'].astype(int) - genes['start'].astype(int) + 1
-                      genes = genes['GeneName'].dropna().tolist()
-                      return genes
+                      gene_info = genes[['GeneName', 'length']].dropna().drop_duplicates(subset='GeneName')

v-rocheleau Nov 19, 2024

Columns should be GeneID and GeneLength for TDS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet